Systems and methods for keyword spotting using adaptive management of multiple pattern matching algorithms

ABSTRACT

Methods and systems for keyword spotting, i.e., for identifying textual phrases of interest in input data. The input data may be communication packets exchanged in a communication network. A keyword spotting system holds a dictionary (or dictionaries) of textual phrases for searching input data. The input data and the patterns are assigned to multiple different pattern matching algorithms. For example, a share of the traffic is handled by one algorithm and smaller traffic shares may be handled by the others. The system monitors the algorithms performance as they process the data to search for a match. The ratio of traffic splitting among the algorithms is dynamically reassigned or adjusted to maximize the overall performance.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims is a continuation of, and claims thebenefit of priority to, U.S. patent application Ser. No. 14/263,108,entitled “SYSTEMS AND METHODS FOR KEYWORD SPOTTING USING ADAPTIVEMANAGEMENT OF MULTIPLE PATTERN MATCHING ALGORITHMS,” filed Apr. 28,2014, whose disclosure is incorporated by reference herein.

FIELD OF THE DISCLOSURE

The present disclosure relates generally to data processing, andparticularly to methods and systems for detecting strings in data.

BACKGROUND OF THE DISCLOSURE

Keyword searching techniques are used in a wide variety of applications.For example, in some applications, communication traffic is analyzed inan attempt to detect keywords that indicate traffic of interest. Somedata security systems attempt to detect information that leaks from anorganization network by detecting keywords in outgoing traffic.Intrusion detection systems sometimes identify illegitimate intrusionattempts by detecting keywords in traffic. Various keyword searchingtechniques are known in the art. For example, Aho and Corasick describean algorithm for locating occurrences of a finite number of keywords ina string of text, in “Efficient String Matching: An Aid to BibliographicSearch,” Communications of the ACM, volume 18, no. 6, June, 1975, pages333-340, which is incorporated herein by reference. This technique iscommonly known as the Aho-Corasick algorithm. As another example, Yu etal. describe a multiple-pattern matching scheme, which uses TernaryContent-Addressable Memory (TCAM), in “Gigabit Rate PacketPattern-Matching using TCAM,” Proceedings of the 12^(th) IEEEInternational Conference on Network Protocols (ICNP), Berlin, Germany,Oct. 5-8, 2004, pages 174-183, which is incorporated herein byreference.

Other string matching algorithms are described, for example, by Navarroand Raffinot, in “Flexible Pattern Matching in Strings—Practical On-LineSearch Algorithms for Texts and Biological Sequences,” CambridgeUniversity Press, 2002, which is incorporated herein by reference.Chapter 3 of this book reviews multiple string matching algorithms suchas the Wu-Manber (WM) and the Set Backward Oracle Matching (SBOM)algorithms.

SUMMARY OF THE DISCLOSURE

An embodiment that is described herein provides a method, includingreceiving input data to be searched for occurrences of a set ofpatterns, assigning the input data and the patterns to multipledifferent pattern matching algorithms, searching the input data usingthe pattern matching algorithms, evaluating a predefined metric, andreassigning the input data and the patterns to the pattern matchingalgorithms based on the evaluated metric.

In some embodiments, evaluating the predefined metric includes assessinga performance measure of the pattern matching algorithms. In otherembodiments, evaluating the predefined metric includes assessing acharacteristic of the input data. In yet other embodiments assigning theinput data and the patterns includes applying each of the patternmatching algorithms to search a respective subset of the input data forthe occurrences of all the patterns.

In some embodiments, reassigning the input data and the patternsincludes reassigning a portion of the input data from a first patternmatching algorithm to a second pattern matching algorithm.

In other embodiments, assigning the input data and the patterns includesdefining one of the pattern matching algorithms as a primary algorithmand assigning a majority of the input data to the primary algorithm, andreassigning the input data and the patterns includes redefining anotherof the pattern matching algorithms to serve as the primary algorithm andshifting the majority of the input data to the redefined primaryalgorithm.

In yet other embodiments, assigning the input data and the patternsincludes applying each of the pattern matching algorithms to search allthe input data for the occurrences of a respective subset of thepatterns.

In an embodiment, evaluating the metric includes evaluating at least onemetric type selected from a group of types consisting of: a volume ofthe input data processed by a given pattern matching algorithm per unittime; a memory size occupied by the assigned patterns; and the memorysize used for maintaining state machines of respective flows of theinput data.

There is also provided, in accordance with an embodiment that isdescribed herein, an apparatus including an input circuit and aprocessor. The input circuit is configured to receive input data to besearched for occurrences of a set of patterns. The processor isconfigured to assign the input data and the patterns to multipledifferent pattern matching algorithms, to search the input data for theoccurrences using the multiple algorithms, to evaluate a predefinedmetric, and to reassign the input data and the patterns to the patternmatching algorithms based on the evaluated metric.

The present disclosure will be more fully understood from the followingdetailed description of the embodiments thereof, taken together with thedrawings in which:

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram that schematically illustrates a system forkeyword searching, in accordance with an embodiment of the presentdisclosure;

FIGS. 2 and 3 are block diagrams that schematically illustrateconfigurations of a processor in a system keyword searching, inaccordance with embodiments of the present disclosure; and

FIG. 4 is a flow chart that schematically illustrates a method forefficient keyword searching, in accordance with an embodiment of thepresent disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS Overview

Embodiments that are described herein provide improved methods andsystems for keyword spotting, i.e., for identifying textual phrases ofinterest in input data. In the embodiments described herein, the inputdata comprises communication packets exchanged in a communicationnetwork. The disclosed keyword spotting techniques can be used, forexample, in applications such as Data Leakage Prevention (DLP),Intrusion Detection Systems (IDS) or Intrusion Prevention Systems (IPS),and spam e-mail detection.

In the disclosed embodiments, a keyword spotting system holds adictionary (or dictionaries) of textual phrases for searching inputdata. In a communication analytics system, for example, the dictionarydefines textual phrases to be located in communication packets—such ase-mail addresses or Uniform Resource Locators (URLs).

In some applications, the dictionary comprises a large number of textualphrases, e.g., on the order of thousands or more, which may differ insize from one another. Each textual phrase in the dictionary typicallycomprises a string of characters, and in some embodiments may comprisevarious wildcard characters. Moreover, the dictionary may change overtime, e.g., textual phrases may be added, deleted or modified. In thedescription that follows, the textual phrases are also referred to askeywords or patterns.

The performance of algorithms for keyword searching (also referred to aspattern matching algorithms) may be affected by many factors. Examplefactors include the dictionary size, the alphabet size (i.e., the numberof different characters in the data), the sizes (or the minimal size) ofthe searched patterns, and the characteristics of the input data. Inaddition, an algorithm may suffer an attack (sometimes referred to as a“pattern matching algorithmic complexity attack” or “payload attack”)that may considerably reduce its efficiency.

In embodiments of the present invention, the keyword spotting systemassigns the input data and the patterns to multiple different patternmatching algorithms. In one embodiment, the system splits the input datatraffic between two or more matching algorithms. In one embodiment adominant share of the traffic is handled by one algorithm and smallertraffic shares by the others. The system monitors the algorithmsperformance (by evaluating a respective metric) as they process the datato search for a match. The ratio of traffic splitting among thealgorithms is dynamically reassigned or adjusted to maximize the overallperformance.

In another embodiment, two or more pattern matching algorithms, eachassigned to a distinct dictionary, process the input data in parallel.In other words, the patterns are split among the matching algorithms.The input traffic is not split but is rather directed in full to each ofthe matching algorithms. The dictionaries together include all thepatterns to be searched. Again, the algorithms performance is monitoredand a respective metric is evaluated as they process the data. Withresponse to data characteristics change over time, patterns may bedynamically reassigned among the different dictionaries to adjust thecorresponding algorithms to maximal overall performance.

The disclosed techniques enable the system to exploit the advantages andavoid the disadvantages of each pattern matching algorithm. Thepresented embodiments enable to handle high-bandwidth traffic withtime-varying characteristics, and to search for a large number ofpatterns that otherwise would not be feasible with limited computingresources. Moreover, the methods and systems described herein areinsensitive to pattern matching algorithmic complexity attacks.

System Description

FIG. 1 is a block diagram that schematically illustrates a system 20 forkeyword spotting, in accordance with an embodiment that is describedherein. System 20 receives communication traffic from a communicationnetwork 24, and attempts to detect in the traffic predefined textualphrases, also referred herein to as keywords or patterns. When one ormore keywords are detected, the system reports the detection to a user28 using an operator terminal 32.

System 20 can be used, for example, in an application that detects dataleakage from a communication network. In applications of this sort, thepresence of one or more keywords in a data item indicates that this dataitem should not be allowed to exit the network. Alternatively, system 20can be used in any other suitable application in which input data issearched for occurrences of keywords, such as in intrusion detection andprevention systems, detection of spam in electronic mail (e-mail)systems, or detection of inappropriate content using a dictionary ofinappropriate words or phrases.

Although the embodiments described herein refer mainly to processing ofcommunication traffic, the disclosed techniques can also be used inother domains. For example, system 20 can be used for locating data ofinterest on storage devices, such as in forensic disk scanningapplications. Certain additional aspects of keyword spotting areaddressed, for example, in U.S. patent application Ser. No. 12/792,796,entitled “Systems and methods for efficient keyword spotting incommunication traffic,” which is assigned to the assignee of the presentpatent applications and whose disclosure is incorporated herein byreference. Other applications may comprise, for example, patternmatching in gene sequences in biology.

Network 24 may comprise any suitable public or private, wireless orwire-line communication network, e.g., a Wide-Area network (WAN) such asthe Internet, a Local-Area Network (LAN), a Metropolitan-Area Network(MAN), or a combination of network types. The communication traffic, tobe used as input data by system 20, may be provided to the system usingany suitable means. For example, the traffic may be forwarded to thesystem from a network element (e.g., router) in network 24, such as byport tapping or port mirroring. In alternative embodiments, system 20may be placed in-line in the traffic path. These embodiments suitable,for example, for data leakage prevention applications, but can also beused in other applications.

Typically, network 24 comprises an Internet Protocol (IP) network, andthe communication traffic comprises IP packets. The description thatfollows focuses on Transmission Control Protocol Internet Protocol(TCP/IP) networks and TCP packets. Alternatively, however, the methodsand systems described herein can be used with other packet types, suchas User Datagram Protocol (UDP) packets. Regardless of protocol, thepackets searched by system 20 are referred to herein generally as inputdata.

In the example of FIG. 1, system 20 comprises a Network Interface Card(NIC) 36, which receives TCP packets from network 24. NIC 36 thus servesas an input circuit that receives the input data to be searched. NIC 36stores the incoming TCP packets in a memory 40, typically comprising aRandom Access Memory (RAM). A processor 44 searches the TCP packetsstored in memory 40 and attempts to identify occurrences of predefinedkeywords in the packets.

The predefined keywords or patterns are stored in a patterns dictionary48. Dictionary 48 may be stored on any suitable storage device. In someembodiments, dictionary 48, or part of it, may be stored in a cachememory (not shown) of processor 44 to increase the access speed by theprocessor. In some embodiments, dictionary may comprise multiplephysical or logical distinct dictionaries.

When processor 44 detects a given keyword in a given packet, it reportsthe detection to user 28 using an output device of terminal 32, such asa display 56. For example, the processor may issue an alert to the userand/or present the data item (e.g., packet or session) in which thekeyword was detected. In some embodiments, processor 44 may take variouskinds of actions in response to detecting a keyword. For example, in adata leakage or intrusion prevention application, processor 44 may blocksome or all of the traffic upon detecting a keyword. User 28 mayinteract with system 20 using an input device of terminal 32, e.g., akeyboard 60.

The system configuration shown in FIG. 1 is an example configuration,which is chosen purely for the sake of conceptual clarity.Alternatively, any other suitable system configuration can be used.Generally, the different elements of system 20 may be implemented usingsoftware, hardware or a combination of hardware and software elements.In some embodiments, processor 44 comprises a general-purpose computer,which is programmed in software to carry out the functions describedherein. The software may be downloaded to the computer in optical orelectronic form, over a network, for example, or it may, additionally oralternatively, be provided and/or stored on non-transitory tangiblemedia, such as magnetic, optical, or electronic memory.

Maximizing Performance by Adaptive Splitting of Traffic

Many algorithms for keyword searching are known in the art. Thealgorithms may differ in several attributes such as run-time,implementation complexity and average or worst-case behavior. Moreover,their performance may be affected by several factors such as the size ofthe dictionary and the alphabet, as well as the length of the keywords.

For example, the run time as a function of the input length may belinear in the worst case, such as in Aho-Corasick (AC, which alsoperforms better with a small dictionary of short patterns, and itsrun-time is not sensitive to the pattern length), or sub-linear onaverage, such as in the Wu-Manber (WM) and the Set Backward OracleMatching (SBOM) algorithms that have a sub-linear on average run-time,and support a large keywords set as well as a large alphabet. The AC,and the SBOM algorithms are additionally relatively simple to implement.Some algorithms such as the WM perform better when searching for a setof only long patterns, as short patterns degrade their performancesignificantly.

Since under different and changing conditions, different algorithms mayperform more efficiently than others, the disclosed techniquesincorporate more than just a single algorithm in a system for keywordsearching. Thus, with limited computation resources, the system maydynamically divert the traffic to the most suitable algorithm so as tomaximize the overall performance.

FIG. 2 is a block diagram that schematically illustrates an exampleconfiguration of processor 44, in accordance with an embodiment of thepresent disclosure. Input traffic data enters a data splitter 100. Inthe example embodiment of FIG. 2, the splitter has one input port andtwo output ports. Processor 44 configures the splitter to extract partor a share of the traffic to each output port. Typically, the inputtraffic comprises multiple flows of packets, and the splitter directscertain segments of the flows (or flows parts) to one port and other orpartially common segments to the other port. The splitter configurationis also referred to herein as a splitting policy.

When dynamically changing the splitting policy the processor shouldavoid missing any patterns as a result of the policy change. Processor44 may use any suitable method to guarantee smooth transition of trafficwith no loss of patterns detection. For example, when changing splittingpolicy, the processor may direct to each algorithm a sufficient lag ofpast characters. As another example, the processor may split the trafficon a flow basis, i.e., aggregate and direct all the data of a flow toone algorithm. As yet another example, a respective data segment aroundthe flow cut point may be handled by a third (not shown) algorithm.

Two different matching algorithms denoted ALGORITHM1 104 and ALGORITHM2108 are assigned input data from the respective output ports of splitter100. When system 20 starts to receive communication traffic, processor44 configures the splitter to an initial splitting policy. The processormay select any suitable initial policy. For example, if the initial datacharacteristics are not available to system 20, the processor mayconfigure to initially split the data evenly.

In some embodiments, one of the algorithms may be a-priori assumed to bethe most efficient for the expected input data. For example, the URL ofthe data source (if available) may indicate the data characteristics. Insuch embodiments the processor may initially configure the splitter todirect a dominant share or even all the traffic to the most efficientalgorithm (referred to as the primary algorithm). Additionally oralternatively, the processor may get an initial splitting policy fromuser 28 via terminal 32.

ALGORITHM1 and ALGORITHM2 are configured to search the data acceptedfrom the splitter for occurrences of patterns stored in a patterndictionary 112. When either algorithm locates a pattern in the data,processor 44 reports the matching event as described in FIG. 1 above.

The performance or efficiency of a matching algorithm may change overtime. For example, modifying/adding/deleting patterns in the dictionary(e.g., by user 28) may reduce the processing complexity of one algorithmand increase the complexity of another algorithm at the same time. Asanother example, as the characteristics of the input data change overtime, the complexity burden on two different algorithms may change inopposite directions.

A performance analyzer 116 monitors the performance, e.g., theefficiency, of each matching algorithm. The efficiency of a matchingalgorithm can be estimated, for example, by evaluating a respectivemetric, such as the amount of input data that the algorithm can processper unit time, e.g., the number of processed input bytes per second.Other example performance metrics include the dictionaries memory size,and the amount of memory needed for flow state machine, i.e., forstoring the internal state of the algorithm for each flow that is beinganalyzed.

In some embodiments each algorithm estimates its own performance andsends it to analyzer 116 for monitoring. Alternatively, the analyzercalculates the performance metric internally. The analyzer may use anysuitable method to decide at what points in time to monitor theperformance. For example the analyzer may monitor the performanceperiodically. The time period may be on the order of a few seconds, orany other suitable time duration. Alternatively, the analyzer maycontinuously measure the algorithms performance. Further additionally oralternatively, the analyzer may monitor the performance in response to achange in the dictionary content by the user.

The analyzer uses the monitored performance to decide on updatedsplitting policy for splitter 100. For example, the analyzer may derivea proportional splitting policy, i.e., the more an algorithm isefficient with respect to the others, a higher share of the traffic isreassigned to that algorithm. As another example the analyzer may derivean absolute splitting policy. For example, the analyzer may compare theperformance of each algorithm to a predefined threshold, and direct mostof the traffic to the algorithm whose performance relative to therespective threshold is the highest.

As yet another example, the analyzer can indicate the splitter toprovide an algorithm with another input data segment, such as a packet,as the algorithm concludes processing a previous input data segment.Alternatively, the processor may use any other suitable method todetermine the splitting policy with response to the monitoredperformance. Typically, the analyzer diverts some of the traffic to eachalgorithm in order to keep monitoring the performance of all thealgorithms.

As yet another example, the analyzer may configure the splitter todirect a suitable data segment at the beginning of a certain flow toboth algorithms. The rest of the flow will be directed to the algorithmthat performed better on that data segment.

In addition to monitoring the algorithms performance, analyzer 116analyzes the characteristics of the input traffic. The analyzer acceptsthe traffic output from the splitter for analysis. Since the datacharacteristics may change over time, and since each algorithm may bebetter tuned to some characteristics, the analyzer may change thesplitting policy accordingly. The analyzer may use any suitable methodto analyze the input data.

For example, the analyzer may calculate statistical attributes of thedata characters. The analyzer can calculate a histogram that counts thenumber of each alphabet symbol in a data segment. In some embodiments,some metadata may accompany the data flow, indicating on the flowcontent, and therefore indicating on the data characteristics. Forexample a video, text, or images content may differ considerably in thedata characteristics. In such embodiments, the analyzer may configurethe splitter to direct a flow to the most suitable algorithm accordingto the accompanying metadata.

The analyzer may analyze the input data at any suitable points in time.For example the analyzer may periodically or continuously perform theanalysis. Additionally or alternatively, the analyzer may perform theanalysis when a new data source joins the traffic.

When deciding on an updated splitting policy as described above,analyzer 116 may additionally consider the inherent complexity of thealgorithms. For example the processor may utilize optimizationtechniques to select a splitting policy that would maximize the overallefficiency (i.e., the total traffic the system can handle per a timeunit), under overall constrained computation resources. As an example,the analyzer may trade computation time versus memory access time andoptimize splitting the traffic among the algorithms accordingly.

Another example that may trigger the processor to change the splittingpolicy is referred to as an algorithmic complexity attack. A complexityattack is typically designed to push a specific algorithm to its worstcase behavior, by planting in the traffic carefully selected datapatterns. Therefore, the performance of a matching algorithm thatsuffers an attack reduces significantly. Since an attack is designed fora specific algorithm, other algorithms may be much less sensitive forthat attack, and would typically maintain high performance.

When one algorithm is attacked, analyzer 116 would sense a significantperformance reduction, and the processor may configure the splitter tostop directing any data to that algorithm. Alternatively, the processormaintains a small share of the traffic directed to the algorithm underattack and keeps monitoring the performance. When the attack stops, theprocessor may again split significant share of the traffic to thatalgorithm.

The embodiments in FIG. 2 use two matching algorithms and a splitterwith two output ports, directing data to each algorithm. Otherembodiments, however, may use any number of different matchingalgorithms and a corresponding suitable data splitter. For example anembodiment may use three different matching algorithms and a splitterwith three output ports.

Maximizing Performance by Splitting Patterns Among Multiple PatternMatching Algorithms

FIG. 3 is a block diagram that schematically illustrates another exampleconfiguration of processor 44, in accordance with another embodiment ofthe present disclosure. Unlike the description of FIG. 2, both matchingalgorithms in FIG. 3, i.e., the full input traffic is assigned to bothalgorithms ALGORITHM1 104 and ALGORITHM2 108. Performance analyzer 116monitors the algorithms performance and analyzes the input datacharacteristics similarly to the methods described in FIG. 2 above. InFIG. 3, ALGORITHM1 and ALGORITHM2 are configured to search foroccurrences of patterns stored in respective dictionaries DICTIONARY1120 and DICTIONARY2 124. Both dictionaries together hold all thepatterns that system 20 is configured to search. Typically, although notnecessarily, the sets of patterns in DICTIONARY1 and DICTIONARY2 aredisjoint.

System 20 can use any suitable method to decide what patterns toinitially put in each dictionary. For example, it may be a-prioriassumed that each algorithm performs more efficiently given a specificset of patterns. As an example, system 20 may assign patterns toalgorithms based on the patterns length. For example, in a system thatuses the AC and the WM algorithms, the system may assign a relativelysmall dictionary (preferably residing in a cache memory) with shortlength patterns to the AC algorithm, and a dictionary of only longpatterns to the WM algorithm.

Additionally, when using a large dictionary, the internal hash functionin the WM algorithm may experience a larger false positive probabilitydue to collisions.

In some embodiments, a certain matching algorithm may perform betterthan others when the patterns for search contain wildcard expressions,i.e., a pattern may not be fully defined. In such embodiments, adictionary with wildcard patterns may be assigned to that superioralgorithm.

Additionally or alternatively, user 28 may configure each dictionarywith selected patterns via terminal 28. As described below, system 20automatically adjusts the dictionaries content on the fly, to maximizethe system performance for varying input traffic.

In yet other embodiments, one or more of the algorithms may sufferperformance degradation when the dictionary changes on the fly. In suchembodiments, new patterns inserted by the user, or patterns moved fromanother dictionary, may be assigned to a temporal dictionary and analgorithm (not shown). Under suitable conditions, patterns from thetemporal dictionary may be merged into the algorithm's dictionary.

As described in FIG. 2 above, the characteristics of the data may changeover time, and as a result affect the performance of the matchingalgorithms. Analyzer 116 monitors the algorithms performance and thecharacteristics of the input data similarly to the description in FIG. 2above, by evaluating a respective metric. When the analyzer detects achange in the algorithms performance and/or in the input datacharacteristics, it may reassign patterns to the dictionaries to adjustand increase the overall performance. To reassign patterns the analyzermay move or swap patterns between the dictionaries. As another example,if one algorithm suffers an algorithmic complexity attack, analyzer 116may move the dictionary patterns that are more susceptible to cause theattack when they are searched to the dictionary of the other algorithm.

The embodiments in FIG. 3 use two matching algorithms and two respectivedictionaries. Other embodiments however may comprise any suitable numberof matching algorithms and respective dictionaries. Moreover, in someembodiments a system may be configured to use a smaller number ofdictionaries than algorithms. In such embodiments, multiple algorithmsmay be configured to search for patterns that are stored in onedictionary. For example, in a system that comprises three algorithms andtwo dictionaries, the first two algorithms may be attached to onedictionary and the third algorithm to the other dictionary.

FIG. 4 is a flow chart that schematically illustrates a method forefficient keyword searching, in accordance with an embodiment of thepresent disclosure. The method begins with system 20 receiving patternsdictionaries at a patterns input step 200. System 20 receives packets(referred to as input data) from network 24 via NIC 36, and stores thepackets in RAM 40, at a data input step 204.

Processor 44 searches the packets using algorithms 104 and 108 (usingdictionary 112 or dictionaries 120 and 124) at a searching step 208.Processor 44 checks whether a match is found between a portion of theinput data and any of the textual phrases (patterns) of thedictionaries, at a matching step 212. If a match with a respectivepattern is found, processor 44 reports the match event to operator 28using operator terminal 32, at an output step 216.

If no match is found, or following a match reporting, the methodproceeds to an analyzing step 220. At step 220 the processor monitorsand analyzes the performance of the matching algorithms ALGORITHM1 104and ALGORITHM2 108. Still at step 220, the processor additionallyanalyzes the characteristics of the input data.

The processor checks if the traffic splitting policy should be changed,at a check analysis step 224. If the analysis of the algorithmsperformance and/or traffic characteristics indicates that by changingthe splitting policy the overall performance will increase, theprocessor sets an updated splitting policy to data splitter 100 atadjusting step 228. Otherwise, the splitting policy is maintained andthe processor loops back to step 204 above, in which system 20 receivessubsequent input data.

Additionally or alternatively, at step 224 above, the processor checksif the analysis of the algorithms performance and/or datacharacteristics indicates that the overall performance may increase bymoving or swapping patterns between DICTIONARY1 120 and DICTIONARY2 124.If the check result is positive, processor 44 adjusts the dictionariescontent by moving or swapping patterns. After adjusting thedictionaries, or if there is no need for such adjustment the processorloops back to step 204.

It will be appreciated that the embodiments described above are cited byway of example, and that the present disclosure is not limited to whathas been particularly shown and described hereinabove. Rather, the scopeof the present disclosure includes both combinations andsub-combinations of the various features described hereinabove, as wellas variations and modifications thereof which would occur to personsskilled in the art upon reading the foregoing description and which arenot disclosed in the prior art.

1. A method for identifying textual phrases of interest in input data,the method being performed by an apparatus comprising a networkinterface card (NIC) and a processor, the method comprising: receiving,by the NIC, input communication traffic to be searched for occurrencesof a set of patterns, wherein each pattern of the set of patternscomprises one or more textual phrases; configuring by the processor, adata splitter in accordance with an initial splitting configurationpolicy to assign the input communication traffic and the patterns tomultiple different pattern matching algorithms, in which certainsegments of the input communication traffic are assigned to a firstpattern matching algorithm and certain segments are assigned to a secondpattern matching algorithm, wherein at least a first pattern of the setof patterns is assigned to the first pattern matching algorithm andwherein at least a second pattern of the set of patterns is assigned tothe second pattern matching algorithm; executing, by the processor, thefirst and second pattern matching algorithms to identify occurrences oftextual phrases from the respective first and second patterns in thecommunication traffic, wherein the execution comprises the first patternmatching algorithm searching within certain for the one or more textualphrases of the first assigned pattern and comprises the second patternmatching algorithm searching within certain segments for the one or moretextual phases of the second assigned pattern; monitoring, by theprocessor, performance of the first and second pattern matchingalgorithms by evaluating a predetermined metric for each of the firstand second pattern matching algorithms; and generating, by theprocessor, for the data splitter, based on the monitored performance, anupdated splitting policy configuration that reassigns which segments ofthe input communication traffic are assigned to which pattern matchingalgorithm and or which pattern of the set of patterns is assigned towhich pattern matching algorithm; and configuring, by the processor, thedata splitter in accordance with the updated splitting policyconfiguration.
 2. The method according to claim 1, wherein evaluatingthe predefined metric comprises assessing a performance measure of thepattern matching algorithms.
 3. The method according to claim 1, whereinevaluating the predefined metric comprises assessing a characteristic ofthe input communication traffic.
 4. The method according to claim 1,wherein assigning the input communication traffic and the patternscomprises applying each of the pattern matching algorithms to search arespective subset of the input communication traffic for the occurrencesof all the patterns.
 5. The method according to claim 4, whereinreassigning the input communication traffic and the patterns comprisesreassigning a portion of the input communication traffic from the firstpattern matching algorithm to the second pattern matching algorithm. 6.The method according to claim 1, wherein assigning the inputcommunication traffic and the patterns comprises defining one of thepattern matching algorithms as a primary algorithm and assigning amajority of the input communication traffic to the primary algorithm,and wherein reassigning the input data and the patterns comprisesredefining another of the pattern matching algorithms to serve as theprimary algorithm and shifting the majority of the input communicationtraffic to the redefined primary algorithm.
 7. The method according toclaim 1, wherein assigning the input communication traffic and thepatterns comprises applying each of the pattern matching algorithms tosearch all the input communication traffic for the occurrences of arespective subset of the patterns.
 8. The method according to claim 1,wherein evaluating the metric comprises evaluating at least one metrictype selected from a group of types consisting of: a volume of the inputcommunication traffic processed by a given pattern matching algorithmper unit time; a memory size occupied by the assigned patterns; and thememory size used for maintaining state machines of respective flows ofthe input communication traffic.
 9. An apparatus for identifying textualphrases of interest in input data, comprising: a network interface card(NIC) configured to receive input communication traffic that is to besearched for occurrences of a set of patterns, wherein each pattern ofthe set of patterns comprises one or more textual phrases; and aprocessor configured to: configure a data splitter in accordance with aninitial splitting configuration policy to assign the input communicationtraffic and the patterns to multiple different pattern matchingalgorithms, in which certain segments of the input communication trafficare assigned to a first pattern matching algorithm and certain segmentsare assigned to a second pattern matching algorithm; execute the firstand second pattern matching algorithms to identify occurrences oftextual phrases from the respective first and second patterns in thecommunication traffic, wherein the execution comprises the first patternmatching algorithm searching within certain for the one or more textualphrases of the first assigned pattern and comprises the second patternmatching algorithm searching within certain segments for the one or moretextual phases of the second assigned pattern; monitor performance ofthe first and second pattern matching algorithms by evaluating apredetermined metric for each of the first and second pattern matchingalgorithms; generate, for the data splitter, based on the monitoredperformance, an updated splitting policy configuration that reassignswhich segments of the input communication traffic are assigned to whichpattern matching algorithm and or which pattern of the set of patternsis assigned to which pattern matching algorithm; and configure the datasplitter in accordance with the updated splitting policy configuration.10. The apparatus according to claim 9, wherein the processor isconfigured to evaluate the predefined metric by assessing a performancemeasure of the pattern matching algorithms.
 11. The apparatus accordingto claim 9, wherein the processor is configured to evaluate thepredefined metric by assessing a characteristic of the inputcommunication traffic.
 12. The apparatus according to claim 9, whereinthe processor is configured to assign the input communication trafficand the patterns by applying each of the pattern matching algorithms tosearch a respective subset of the input communication traffic for theoccurrences of all the patterns.
 13. The apparatus according to claim12, wherein the processor is configured to reassign the inputcommunication traffic and the patterns by reassigning a portion of theinput communication traffic from the first pattern matching algorithm tothe second pattern matching algorithm.
 14. The apparatus according toclaim 9, wherein the processor is configured to define one of thepattern matching algorithms as a primary algorithm and assigning amajority of the input communication traffic to the primary algorithm,and to reassign the input communication traffic and the patterns byredefining another of the pattern matching algorithms to serve as theprimary algorithm and to shift the majority of the input communicationtraffic to the redefined primary algorithm.
 15. The apparatus accordingto claim 9, wherein the processor is configured to assign the inputcommunication traffic and the patterns by applying each of the patternmatching algorithms to search all the input communication traffic forthe occurrences of a respective subset of the patterns.
 16. Theapparatus according to claim 9, wherein the processor is configured toevaluate the metric by evaluating at least one metric type selected froma group of types consisting of: a volume of the input communicationtraffic processed by a given pattern matching algorithm per unit time; amemory size occupied by the assigned patterns; and the memory size usedfor maintaining state machines of respective flows of the inputcommunication traffic.
 17. A non-transitory computer readable mediahaving instructions stored thereon for identifying textual phrases ofinterest in input data that, when executed by a computing system, causethe computing device to at least: receive input communication trafficthat is to be searched for occurrences of a set of patterns, whereineach pattern of the set of patterns comprises one or more textualphrases; and configure a data splitter in accordance with an initialsplitting configuration policy to assign the input communication trafficand the patterns to multiple different pattern matching algorithms, inwhich certain segments of the input communication traffic are assignedto a first pattern matching algorithm and certain segments are assignedto a second pattern matching algorithm; execute the first and secondpattern matching algorithms to identify occurrences of textual phrasesfrom the respective first and second patterns in the communicationtraffic, wherein the execution comprises the first pattern matchingalgorithm searching within certain for the one or more textual phrasesof the first assigned pattern and comprises the second pattern matchingalgorithm searching within certain segments for the one or more textualphases of the second assigned pattern; monitor performance of the firstand second pattern matching algorithms by evaluating a predeterminedmetric for each of the first and second pattern matching algorithms;generate, for the data splitter, based on the monitored performance, anupdated splitting policy configuration that reassigns which segments ofthe input communication traffic are assigned to which pattern matchingalgorithm and or which pattern of the set of patterns is assigned towhich pattern matching algorithm; and configure the data splitter inaccordance with the updated splitting policy configuration.
 18. Thenon-transitory computer readable media according to claim 17, whereinthe computing device is configured to assign the input communicationtraffic and the patterns by applying each of the pattern matchingalgorithms to search a respective subset of the input communicationtraffic for the occurrences of all the patterns.
 19. The non-transitorycomputer readable media according to claim 18, wherein the computingdevice is configured to reassign the input communication traffic and thepatterns by reassigning a portion of the input communication trafficfrom the first pattern matching algorithm to the second pattern matchingalgorithm.
 20. The non-transitory computer readable media according toclaim 17, wherein the computing device is configured to evaluate themetric by evaluating at least one metric type selected from a group oftypes consisting of: a volume of the input communication trafficprocessed by a given pattern matching algorithm per unit time; a memorysize occupied by the assigned patterns; and the memory size used formaintaining state machines of respective flows of the inputcommunication traffic.